Prosper Loan Exploratory Data Analysis by Brandon Brown
Prosper is a peer-to-peer lending marketplace in the United States which provides a platform for borrowers and lenders (Reference 1). Borrowers can apply for a fixed-rate, fixed-term loan of between $2,000 and $35,000 and individuals/institutions can invest in loans which generate attractive returns (Reference 1). Data is available on Prosper loans (Reference 2). This data includes interest rates, borrower employment status, borrower income, loan amount, borrower credit history and many other variables (Reference 3). The purpose of this project is to explore the data to determine which conditions might be associated with whether a borrower will be successful in paying back the original loan amount and interest. I expect to see which borrower employment status, credit score, debt to income ratio and other loan variables are most associated with those who default on their loans and who successfully complete their loan payments.
Univariate Analysis
First, I will perform some univariate analyses, plotting bar charts and histograms of various variables of interest to get an idea about how each variable is related to measures such as the number of borrowers who default and the number who successfully pay back their loans. Once we have the univariate plots, we can have an idea about what variables we may want to explore in a multi-variate fashion.
## [1] 113937 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Since there are 81 variables, it is impossible to examine the relationships between all of them. So I will pick a subset of the variables. My main question is how variables of interest relate to the ability or inability of a borrower to pay back the loan amount plus interest. Because of this, the most important variable I will be looking at is Loan Status. This is a categorical variable with categories such as Completed, Defaulted, and Past Due for varying number of days among others. It would make sense that a person’s employment status will affect their ability to pay back the loans so I will also look at this variable. A borrower’s employment duration probably correlates with their ability to keep their job, which would make it easier to pay back the loans so I will look at this variable as well. I will also look at the income range for borrowers since having a higher income would probably make it easier to pay back loans. A borrower’s credit score is a statistical method to determine the likelihood that they will pay back a loan in a specified amount of time (Reference 4), so I will look at that variable and see how it is related to loan status (specifically the lower range of a person’s credit score at the time the listing was created). It would be expected that a borrower with a low debt to income ratio would have an easier time paying back loans, so this variable will be considered in this analysis.
Two variables which are only for listings post July 2009 are Prosper Score and Prosper Rating. A Prosper Score is a number from 1 to 11 determined by Prosper which represents the probability of a loan going “bad” where bad is defined as the loan going 60 or more days past due within 12 months of the loan origination (1 is the highest risk score and 11 is the lowest risk) (Reference 5). I will see how this variable is related to loan status and also to another variable, the number of days the loan is delinquent (number of days a payment is past due) (Reference 6). A Prosper Rating is a grade assigned to a loan to determine the estimated annual loss to potential investors (AA to HR with AA having the lowest estimated loss and HR having the highest) (Reference 7). It would be expected that in general, loans with the best rating (AA) would have the fewest number of borrowers who default on the loan. One important point to keep in mind during this project is that a Prosper Rating may not be accurate because it is based on a borrower’s credit report which may contain inaccurate information (Reference 7). Another variable that will be examined is whether or not a borrower is a home owner. If a borrower is a home owner, I would expect they will have higher credit worthiness and be less susceptible to going into default. Finally, I will look at Borrower APR. An APR is an annual charged rate which is the cost of borrowing the money (Reference 8). My guess is that the lower this amount is, the easier it will be to pay back the loans. I will look at this variable as well.
In summary the dependent variables for this project are loan status and number of days the current loan is delinquent. The independent variables are employment status, employment duration, income range, credit score range lower, debt to income ratio, prosper score, prosper rating, borrower APR and is borrower a homeowner. Note that for prosper score and prosper rating I will need to subset the data frame to only include loans which actually have a score and rating (those post July 2009). Note also that for income range, I will subset and only consider data for which the income is verifiable (I have included this variable as well in the data I will use) and for employment duration (employment status duration), I will not consider statuses of no listing (a blank status), not available, retired and other. I will set the employment duration of those who are unemployed as 0.

## Completed Current Past Due Defaulted Chargedoff Cancelled
## 38279 56576 2067 5018 11992 5
From this plot, the category with the highest proportion of borrowers is clearly the Current loans. I lumped all the Past Due categories into one Past Due category since I only really care about Completed, Defaulted and Chargedoff loans. I also lumped the FinalPaymentInProgress category into the Completed category since these loans are for all intents and purposes finished. I did some research and Chargedoff loans are Defaulted loans for which there is no expectation of re-payment by the lenders (Reference 9). Because of this, I am equally as interested in Chargedoff loans as I am those in Default. Completed loans clearly have a better proportion than the combination of Defaulted and Chargedoff loans (around 0.37 versus around 0.15). It will be interesting to see how these proportions match up with those of employment status, credit score and the other variables.

## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
From the data available we can conclude that the information given isn’t very helpful at possibly predicting whether a borrower will be successful in paying back their loans. This is primarily because the categories are vague. The highest proportion of borrowers fit into the “Employment” category. This category is very vague. It doesn’t indicate whether the borrower is a Full-time or Part-time employee (there are actually categories for this). It also doesn’t indicate what their salary is. There are also the categories of Other and a portion of the borrowers who did not list an employment status. If I were to decide to examine this data further in the bivariate and multivariate portions of the analysis, I might get rid of the borrowers fitting in the Other and non-labeled categories. I may construct a plot or two to see if there are relationships, but I think other variables I’ve chosen to focus on will give a better indication of success. I picked this variable just to see if there may actually be an obvious pointer to a relationship but there isn’t (for an obvious pointer to a relationship to exist, the proportions of one category would need to equal that of the Completed and Default/ChargedOff categories). Even if the proportions were equal, the relationship may not be straight-forward between these two variables.

## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 122 6250 29867 29345 16010
## $100,000+
## 16064
From the above bar chart of proportion of borrowers inside each income range I considered, the largest proportions are for $25,000-49,999 and $50,000-74,999. The proportion was about the same for the higher two categories. Note that I did not consider data which was not verifiable. I also lumped two categories together, borrowers which had an income of $0 and borrowers which had an income of ‘Not employed’. It’s hard to say which of these categories may be associated with higher levels of success or failure in paying back the loans. It would be assumed that the higher a borrower’s income, the easier it will be for them to pay back a loan but this isn’t the only variable affecting success or failure so I will need to further look at this in the bi-variate and multi-variate case.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 67.00 95.51 137.00 745.00
Based on the second and third graphs, the data is definitely right-skewed. This data represents employment status duration, but because of how I have set the number of months worked for those unemployed to be 0, I am basically considering employment duration. I did this in part because there were outliers for those not employed of multiple years. I cannot see Prosper letting a borrower have a loan who had not been employed for many years (I originally let the employment status duration for those not employed to be negative). There are still outliers in the data here. From the summary, the maximum number of months worked is 745 or around 63 years. This may actually be possible, but if I graph this data against loan status or any other variable in the bi-variate or multi-variate portions of this project, I will cut off this data from consideration. In the next R chunk, I will see what this data looks like for anything above 400 months (33 years). After transforming the data using log10, the data is still left-skewed. Because of this, I may not transform the data in the bi-variate or multi-variate case. My guess is that the larger ones’ employment duration, the more likely they are to pay back loans.
## loan_status employment_status employment_status_duration
## 138 Current Employed 551
## 307 Current Employed 413
## 450 Completed Employed 407
## 487 Completed Employed 456
## 521 Current Self-employed 446
## 532 Completed Full-time 409
## income_range income_verifiable borrower_apr credit_score_range_lower
## 138 $1-24,999 True 0.21648 660
## 307 $1-24,999 True 0.35797 720
## 450 $50,000-74,999 True 0.35797 680
## 487 $75,000-99,999 True 0.22872 780
## 521 $75,000-99,999 False 0.16324 700
## 532 $50,000-74,999 True 0.18526 680
## debt_to_income_ratio prosper_score prosper_rating days_delinquent
## 138 0.44 4 C 0
## 307 0.33 3 HR 0
## 450 0.18 3 HR 0
## 487 0.16 6 C 0
## 521 NA 4 B 0
## 532 0.23 4 B 0
## is_homeowner
## 138 False
## 307 True
## 450 True
## 487 True
## 521 False
## 532 True
After looking at the data for durations above 400 months (and for the specific case of 743 months, not shown), I actually do think this data is possible and that the information wasn’t miss-entered. Because of this, I will keep this data when performing calculations in the bi-variate and multi-variate case.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 660.0 680.0 685.6 720.0 880.0
From the three plots of the histograms, each more finely tuned than the next, I notice a few things. The first is that there are some borrowers with a credit score of zero. I am going to see what the rest of the variables looks like for some of these borrowers to see if I should discard them or if I should keep them (if it was simply an error in entering the data). The next is that the data is pretty close to being normal. From the summary given, the mean is only 5.6 away from the median out of a range of 40 for the IQR. Since the mean is greater than the median however, it is skewed to the right. I don’t think it’s enough to warrant performing a transformation on the data, however. The middle plot was bi-modal but changing the binwidth caused this to disappear from the data. Another thing I noticed after changing the binwidth is that there appears to be only integer values of credit score in increments of 20. There do not appear to be values that aren’t whole or that fall between values of 600, 620, 640 and so on (note that I confirmed this using the table command but didn’t include it in the final code). In terms of lower range of credit score predicting the ability of a borrower to pay back their loans, there is a fairly steep drop off from around 640 to 620 in the number of borrowers. I would be interested to see what percentage of the borrowers to the left of the drop off are in each of the loan status categories. I may look at how this variable changes with each loan status category using a box plot in the bi-variate analysis.
## loan_status employment_status employment_status_duration income_range
## 795 Completed Not available NA <NA>
## 913 Defaulted Not available NA <NA>
## 1686 Chargedoff Not available NA <NA>
## 4291 Chargedoff NA <NA>
## 4372 Defaulted Not available NA <NA>
## 4431 Defaulted Not available NA <NA>
## income_verifiable borrower_apr credit_score_range_lower
## 795 True 0.29776 0
## 913 True 0.27276 0
## 1686 True 0.18065 0
## 4291 True 0.24502 0
## 4372 False 0.24505 0
## 4431 True 0.16296 0
## debt_to_income_ratio prosper_score prosper_rating days_delinquent
## 795 NA NA 0
## 913 0.03000 NA 158
## 1686 NA NA 2271
## 4291 0.10368 NA 2388
## 4372 NA NA 209
## 4431 NA NA 202
## is_homeowner
## 795 False
## 913 False
## 1686 False
## 4291 False
## 4372 False
## 4431 False
Looking at the first 6 rows of data for those with a lower credit score range of 0, I actually think the data is good and that is truly the lower range on the credit score. The employment status and debt to income ratio is unknown for the first six rows of data, so these rows are probably cases where very little information is known about the borrowers so the risk for investors will be high. The APR seems to be high for the first 6 rows of data so that would make sense for borrowers with bad credit scores.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1400 0.2200 0.2759 0.3200 10.0100
A few things stand out about these graphs. The first is that there appears that there might be outliers close to a ratio of 10. Again, I will check these out in the next R chunk to see if I should drop the data. The second is that like the credit score graph, the data is skewed to the right. This initially surprised me but it didn’t after thinking on it some more. I was initially thinking that the higher a borrower’s debt to income ratio is, the less able they will be to pay back their loans. But it actually may be that borrowers with higher incomes are able to take on much more debt and still be successful at paying back the loans (my guess is that there is a lot of overlap between the two graphs). A significant portion of the debt for those with higher debt to income ratios is probably student loan debt. It could very well be that borrowers with low debt to income ratios simply don’t have the ability to take out loans because their income is low and because of this, they are the ones who will be less successful at paying back loans.
## loan_status employment_status employment_status_duration income_range
## 23 Current Other 68 $1-24,999
## 227 Completed Full-time 9 <NA>
## 577 Completed Retired 65 $1-24,999
## 627 Current Employed 51 $1-24,999
## 1243 Completed Not employed 0 $0
## 1926 Defaulted Not employed 0 $0
## income_verifiable borrower_apr credit_score_range_lower
## 23 True 0.35356 700
## 227 False 0.11749 600
## 577 False 0.10505 800
## 627 True 0.35797 700
## 1243 False 0.11415 740
## 1926 False 0.08684 740
## debt_to_income_ratio prosper_score prosper_rating days_delinquent
## 23 9.20 2 HR 0
## 227 10.01 NA 0
## 577 10.01 NA 0
## 627 10.01 4 HR 0
## 1243 10.01 NA 0
## 1926 10.01 NA 296
## is_homeowner
## 23 False
## 227 True
## 577 False
## 627 False
## 1243 False
## 1926 False
After looking at the first few rows for borrowers with a debt to income ratio greater than 5, I now think that this data should be left. One of the reasons borrowers may have an extremely high debt to income ratio is because they have either lost their jobs but have fairly good credit scores or because they are simply between jobs. This would cause an extremely low income which would explain the high debt to income ratios.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229
The first thing to notice about this graph is that from the summary and the first two plots, the data appears to be skewed to the right. This is confusing to me because I was assumming that borrowers with good credit scores would have lower APR’s. I was assuming that this graph would be skewed to the left and be a reflection of the credit score and debt to income ratio histograms. A few things could be responsible for this result. The first is that there may not be a nice relationship between good credit score and borrower apr. Perhaps a borrower has a good credit score but they have lost their job which makes them a risky investment. Or at the other end of the spectrum, perhaps a borrower has a bad credit history but has landed a good job such that prosper assigns a lower than expected interest rate. And it is not even certain that there is a nice relationship between credit score and debt to income ratio as I guessed earlier. I will have to graph these variables against each other to determine a discernible relationship. Another observation pointing to the possibility that there may not be a nice relationship between borrower apr and credit score/debt to income ratio is that the mode on the borrower apr is significantly different from the mean and median. From the third plot with a fine binwidth, the mode appears to be around 0.36 while the median and mean is around 0.21. One final observation is that there are borrowers with apr’s greater than 0.4. I’m going to assume that these data points are accurate and are not outliers which have been incorrectly entered since I accepted the data I thought were outliers for debt to income ratio and credit score.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 10.00 27.00 41.78 72.00 129.00
For the days_delinquent variable, I only considered loans that had not been charged off or defaulted since they are in a separate category and will have large values. I also only considered cases for which there was a non-zero value for days_delinquent. Had I considered all the borrowers, a disproportionally large number of them would have been zero days’ delinquent and I wouldn’t be able to gain any insights on borrowers who hadn’t yet defaulted or been charged off on their loans who also are delinquent in paying their loans. From these distributions, the data is definitely skewed to the right with the median at 27 days. Most of the cases are below 50 days. It’s hard to know which variables may predict the number of days’ delinquent since thus far I have considered all the data and not subsetted the data as I did in this R chunk. It may also be good in the future to consider the ratio of days’ delinquent to number of days since the loan originated. This might be a better indicator of the relative success of borrowers paying back their loans. A borrower may be 30 days’ delinquent but only 2 months into paying back the loans while another borrower may be 30 days’ delinquent but be halfway through paying back the loans (a matter of years). The delinquency for the second borrower isn’t going to be viewed in the same way as that of the first. This variable is a secondary independent variable to that of loan status so I’m not as concerned at gaining insights into what may predict these numbers.

## False True
## 56459 57478
From the bar chart above showing proportions of borrowers who own a home, we can see that there is minimal difference between the conditions. The proportion of borrowers who own a home is slightly greater than 50%. It’s hard to determine how well this variable predicts loan payback success rate without using bi-variate and multi-variate graphic capabilities in R. I will facet loan status over each of these conditions to see which condition has a higher level of success and failure. It would make sense that borrowers owning a home will be more likely to payback their loans than those that do not own a home.

## 1-3 4 5 6 7 8 9-11
## 14400 12595 9813 12278 10597 12053 13117
Based on the bar chart shown, the combinations of prosper scores 1-3 have the largest proportion of borrowers while the combinations of prosper scores 9-11 have the second largest proportion of borrowers. I decided to combine the totals for 1-3 and 9-11 into one category so that I wouldn’t be considering 11 different categories which seemed to be too much. It should be noted that scores of 4 and 6 would have had the highest proportions if I hadn’t done this. This would be reasonable since most users will have around an average risk associated with the investment (prosper scores of 1 and 11 would have had the lowest and second lowest proportions, respectively). In terms of prosper scores predicting whether a loan will be defaulted, chargedoff or completed, it is hard to tell. It is worth mentioning that around the same proportion of borrowers assigned a rating between 9-11 also either defaulted or were chargedoff with their loans based on the loan status graph previously, but that data considered all loans and not just those after July 2009 (the bar chart above is only for data post July-2009). I will need to use R’s bi-variate graphing capabilities to see if a relationship exists, I suspect it will.

## HR E D C B A AA
## 6935 9795 14274 18345 15581 14551 5372
From the above bar chart, it appears that the prosper rating with the highest proportion of borrowers is that with a rating of C. As with the prosper scores, this is reasonable since this is the average prosper rating. Most borrowers will not have excellent or bad credit but have average credit. The two ratings with the lowest and second lowest proportion of borrowers are the AA and HR ratings, respectively. As I mentioned with the prosper score, I will need to use R’s bi-variate graphing capabilities to determine the relationship between prosper score and loan status.

## Completed Current Past Due Defaulted Chargedoff Cancelled
## 1500 2784 22 13 48 0
## Completed Current Past Due Defaulted Chargedoff Cancelled
## 914 1452 127 91 505 0
The first plot in the visual directly above is the loan status for the best possible combination of prosper rating and score (AA and 9-11) while the second plot is the loan status for the worst possible combination of prosper rating and score (HR and 1-3). Clearly there is a higher precedence of loans being past due, defaulted and charged off for the worst combination than the best combination. There is also a higher percentage of loans which are completed in the best combination versus the worst combination. This would seem to point to a relationship between these three variables. I will be examining this further in the bi-variate and multi-variate portion of my analysis.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.25 42.50 47.77 73.50 118.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 10.00 25.00 42.17 72.00 121.00
The above two histograms are the distributions of days’ delinquent under the best (AA and 9-11) and worst (HR and 1-3) possible combination of prosper scores and ratings as I did with the loan status variable. As I did in the original plots of days’ delinquent, I limited the data to cases where the days’ delinquent was greater than 0 and to cases where the loan had not been charged off or defaulted. Clearly there are fewer cases of days’ delinquent for the
best combination than the worst combination. Looking at the summary however, the average number of days’ delinquent is greater for the best case scenario than the worst case scenario. A couple of things should be kept in mind. The first is that it may be better to create a variable where the number of days’ delinquent is divided by the number of days since the origination of the loan.
As I mentioned before, someone may be thirty days’ delinquent on a loan but be really close to finishing paying off the loan. The second is that the number of borrowers’ delinquent being greater in the worst combination case may simply be because Prosper assigned more borrowers this combination than they did borrowers the best combination. It could be that prospers ratings and scores are good at predicting loan status but not good at predicting days’ delinquent. Again, I’m not as concerned about predicting this variable but it is something to look at in future analyses.
Summary Observations about Univariate Analysis
This concludes the univariate portion of the analysis. The following are some summaries about what I have observed so far.
Structure of the dataset and summary of results
My dataset had 81 variables with 113,937 observations. I only decided to consider 11 total variables including loan_status, employment_status, income_range, employment_status_duration, credit_score_range_lower, borrower_apr debt_to_income_ratio, prosper_score, prosper_rating, days_delinquent and is_homeowner.
The variables loan_status, employment_status, income_range, prosper_score, prosper_rating and is_homeowner are all factor variables (note that I turned prosper_score into a factor variable from a number variable) with the following levels:
loan_status: Defaulted, Chargedoff, Cancelled, Past Due, Current, Completed
employment_status: , Employed, Full-time, Not available, Other, Part-time, Retired, Self-employed
income_range: $0, $1-24999, $25000-49999, $50000-74999, $75000-99999, $100000+
prosper_score: 1-3, 4, 5, 6, 7, 8, 9-11
prosper_rating: AA, A, B, C, D, E, HR
is_homeowner: True, False
Most of the borrowers are in a Current status with their loans while around 15% of total borrowers have either Defaulted or been charged off. The employment status variable is very vague. Most of the borrowers have a status of employed but it is not clear what this means (full-time or part-time?). Around 30% of borrowers have an income in the ranges of $25000-49999 and $50000-74999 (note that these proportions are only for data for which the income is verifiable and known). Most of the borrowers are either in the 1-3 or 9-11 range of prosper scores but if all the scores are considered separately, most of the borrowers have either a 4 or 6 rating. Most of the borrowers have a loan status of C. And around an equal number of borrowers are homeowners. The combination of prosper score and prosper rating appeared to have an effect on loan status.
The variables employment_status_duration, credit_score_range_lower, debt_to_income_ratio and borrower_apr are all right-skewed and I think there is some overlap between them. I think they are related to each other and I will be examining this question in the bi-variate analysis.
Main features of the Prosper loan data
The main features of interest in the selected data are loan status, prosper score and prosper rating. Prosper score is a variable that Prosper assigns to loans to determine the risk they pose to investors (Reference 5). This score takes into account a number of variables such as debt to income ratio, loan payment performance on prior loans, credit card utilization, number of delinquent accounts and others (Reference 5). Prosper rating is another variable Prosper uses which is based on the Prosper score and credit score (Reference 7). Both these are variables Prosper uses to uniquely judge the risk of a potential loan, so I want to see how well they do at associating themselves with borrowers who Default, are charged off or who complete the loan. I have already done this somewhat in the second to last graph of the univariate analysis. I looked at the loan status bar chart for the best and worst possible combinations of prosper score and rating. It was clear that the worst possible combination had higher rates of Defaulted and Chargedoff loans and a lesser rate of completed loans relative to the best combination.
Supporting features of Prosper loan data that will aid in investigation of main features.
I think the question of whether a borrower is a homeowner will aid in my investigation. I want to see which condition (owning or not owning a home) is associated with higher success and failure rates for paying back the loans. If a borrower owns a home, that means a bank has already approved them for a loan after examining their credit history. This would be a good sign that the borrower should be trusted. I also think borrower apr, credit score, debt- to-income ratio and employment status duration will be beneficial to my investigation. I think they are all related to each other and I will try to eliminate at least one of them in the bi-variate analysis portion of the investigation. I wanted to see if employment status would support my investigation but the categories seem rather vague and are probably not useful at predicting loan payback success or failure. I will also look at how income range affects loan status. I predict that higher incomes will be associated with greater loan payback success.
Unusual observations from the Univariate Analysis and ways I adjusted the data
There were three main results from the Univariate analysis that I thought were unusual. The first was the shape of the debt-to-income ratio histogram. I initially thought that the less that number was, the more successful borrowers would be at paying back loans. But I think that there is a lot of overlap between the shape of the credit score histogram and that of the debt-to-income ratio. I have not considered these variables together, but my guess is that the reason an individual may have a higher than average debt-to-income ratio is because they have a higher salary and can take on more debt. Even though the the ratio may be higher, they will have more funds to pay back loans and will be more successful at paying back the loans. I need to confirm this but that is my guess. The shape of this curve could also be due to borrowers having a large student loan. This would cause the debt-to-income ratio to be higher than average but if borrowers are in a highly paying field (law or medicine), they will still be able to successufully pay back the loans.
The second result that I thought was unusual was the shape of the borrower apr histogram. This too was skewed to the right, but if my guess that the credit score and debt-to-income ratio histograms overlap is correct, this curve should have been skewed to the left. The borrowers with higher credit scores should qualify for lower interest rates (Reference 10). Again, I need to graph these variables against each other to truly know what’s going on.
The third result I thought was unusual was when I compared the distribution of days’ delinquent between the best combination of prosper score and rating and the worst combination of prosper score and rating. I thought the average days’ delinquent would be greater for the worst combination than the best combination but the relationship was reversed. I think to truly analyze this variable, it needs to be divided by number of days since loan originated. As I said in the analysis, a borrower who is 30 days delinquent but 2 years into paying back the loan will be viewed differently than a borrower who is 20 days delinquent and only 3 months into paying back the loan.
I combined some of the categories in loan status, income range and prosper score. I also only found days’ delinquent for cases where the borrower hadn’t defaulted or been charged off and for cases where the days’ delinquent was greater than 0.
I combined all the Past Due categories to Past Due in the loan status variable because I’m mainly only interested in the Completed, Chargedoff and Defaulted categories for this variable. I combined these categories to lessen the clutter of the bar chart.
I combined prosper scores 1-3 into one category and prosper scores 9-11 into one category. I did this because I thought 11 categories seemed too much and I don’t need all of them to gain insight into how prosper score may determine loan status.
I only found the days’ delinquent histogram for borrowers who had greater than 0 days’ delinquency because if I hadn’t, the bin at 0 would have dominated the rest of the histogram and I couldn’t have made an observations about the data. I also only considered cases where the borrower hadn’t defaulted or been charged off because these cases will be irrelevant to investors as they consider days’ delinquent. They have already reached a final status and I am predicting these statuses elsewhere.
For income range, I only considered incomes which could be verified because that’s the income that could be trusted. I also combined the 0$ and Not employed categories since they mean the same thing.
For employment duration, I let borrowers who were not employed have a duration of 0 because the original variable was employment status duration and I wanted to consider how long a borrower had been emloyed. The same logic holds for why I didn’t consider borrowers who had retired. I also did not consider employment statuses of blank, not available and other.
For all the variables, wherever NA’s were present, I filtered them out of the data.
A note about the days’ delinquent variable
This analysis will not consider the days’ delinquent variable going forward. This variable was an extra independent variable that I wanted to consider but upon looking at the histograms for the best and worst possible combinations of prosper score and rating, I have decided that more variables would need to be added and that this variable would be worth it’s own analysis at some point in the future. The only independent variable I will consider is loan status for the remainder of the analysis.
Bivariate Plots Section
First, I will look at some matrices of variables plotted against each other to see if I can find any relationships between the variables. I will see if any of the independent, non-categorical variables are correlated with themselves and if there is a pattern for each of them with the loan status variable. I will also see if there is a relationship between loan status and the categorical variables. Finally, I will see if there is a relationship between loan status and prosper score/rating. These matrices will give me an idea about what to expect as I make the bivariate plots in this section.

From the matrix given, there does not appear to be any strong linear correlations between the variables. There are no monotonic relationships between the variables (this is confirmed with the pearson correlation coefficients given) (Reference 11). I do however, want to take a closer look at debt_to_income ratio versus credit_score_range_lower.
In terms of the relationship between loan status and the independent variables, there appears to be an expected relationship between loan status and borrower apr/lower range of credit score. Completed and current loans appear to have the lowest median borrower apr even though there is a larger spread for completed loans from the box plots. Past due has the highest median apr with the two worst possible cases, defaulted and chargedoff, possessing the third and second highest median apr, respectively. Higher credit score was also associated with loan payback success (either completed or current) while lower credit score was associated with failure (either defaulted or charged off). It will be interesting to look at these variables and the others further in the bi-variate analysis.

From the above matrix of variables plotted against each other, it is difficult to determine patterns for loan_status against each of the independent variables because these are basically faceted bar charts. I will need to look at these relationships individually.

From the above scatterplot matrix, for loan status, prosper rating and prosper score, there are some expected and unexpected results. For a loan status of completed, prosper score category of 9-11 has the highest percentage of borrowers (to be expected as these are the lowest risk loans). A loan status of defaulted has about equal counts across all prosper scores but this is difficult to see in the matrix and will need to be further examined in the bi-variate analysis. A prosper score of 1-3, 5 and 6 has highest proportions of borrowers with a loan status of chargedoff.
A surprising result was that a prosper rating of D had the highest proportion of borrowers with a completed loan. The three worst prosper ratings (HR, E, and D) had the highest proportion of borrowers who had been chargedoff. As with prosper score, prosper ratings effect on borrowers who had defaulted was difficult to see.

## [1] -0.01316852
From the above grid of graphs and the pearson correlation coefficient given, it is easy to see that there is no correlation between the lower range of credit score and debt_to_income_ratio. The data is certainly not monotonic. It seems from the last three graphs that the data is almost parabolic (except for the two points closer to the lower end of credit score). For the second graph, I decided to cut off the data at debt_to_income_ratio of 10 and credit_score_range_lower of 0. The last two graphs represent the conditional means and medians of debt_to_income_ratio at each of the credit scores since the credit scores seem to be in discrete values. I left the debt_to_income_ ratio values at 10 and the credit_score_range_lower values of 0 in these calculations because even though they are outliers, I still think those pieces of data are valid based off the univariate analysis. Since this data is not monotonic, I will need to consider both variables as I proceed through the bi- variate analysis.
From the above grid of scatterplots and line graphs, it appears that there are three regions. The first for which credit scores are relatively bad but debt-to-income ratios are low. The second for which debt-to-income ratios have a peak mean and median but with average credit scores. And the third for which debt to income ratios are also low but credit scores are very good. I am guessing that the two regions on the end are associated with the lowest and highest income ranges possible. I am now going to look at a box plot of credit score versus income range to see if my guess is correct.

## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 605.0 660.0 663.9 720.0 820.0
## --------------------------------------------------------
## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 640.0 680.0 668.8 700.0 880.0
## --------------------------------------------------------
## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 640.0 680.0 678.6 720.0 880.0
## --------------------------------------------------------
## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 660.0 680.0 690.9 720.0 880.0
## --------------------------------------------------------
## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 660.0 700.0 699.9 740.0 880.0
## --------------------------------------------------------
## subset(prosp_income, !is.na(credit_score_range_lower))$income_range: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 680.0 700.0 710.5 740.0 880.0
From the above box plot and summaries, we have what is to be expected. The median credit score for borrowers increases as income range increases. The means (red x) also increase as income range increases. Because of this, the portion of the debt_to_income_range versus credit score graph in the far left is probably for borrowers with an income range of $0 and $1-24,999 while borrowers with an income range at or greater than $100,000 probably dominate the right side of the scatter plots. Income reduction is a big reason borrowers go bankrupt so this box plot does not surprise me (Reference 12).
Now I will look at how loan status is related to each of the non-categorical variables by looking at box plots of the data. First I will consider employment duration.

## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 21.00 52.00 80.62 112.00 745.00
## --------------------------------------------------------
## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 33.0 81.0 108.9 157.0 649.0
## --------------------------------------------------------
## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Past Due
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 28.00 69.00 96.12 136.50 527.00
## --------------------------------------------------------
## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 18.0 51.0 78.1 111.0 554.0
## --------------------------------------------------------
## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 18.00 50.00 78.54 110.00 623.00
## --------------------------------------------------------
## subset(prosp_edur, !is.na(employment_status_duration))$loan_status: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30 30 30 30 30 30
Generally based on the above box plots and summary I get results I expected. The lowest employment durations (mean and median) are for Defaulted and Chargedoff borrowers. Defaulted has the lowest mean employment duration while Chargedoff has the lowest median employment duration. This relationship isn’t too strong though as the difference in median employment duration between Chargedoff and Completed borrowers is only 2 days. The difference in means is even less. A statistical test for significance would need to be conducted to see if these differences are significant (my guess is that they are not). In contrast, the difference in lower range of credit score between borrowers with at or greater than $100,000 and those with income range of $1-24,999 is 41.7. But a statistical test for significance is still needed.
Now get box plots of credit score lower range versus loan status.

## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 640.0 680.0 685.7 740.0 880.0
## --------------------------------------------------------
## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 698.7 720.0 880.0
## --------------------------------------------------------
## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Past Due
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 688.1 720.0 860.0
## --------------------------------------------------------
## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 560.0 640.0 620.9 680.0 860.0
## --------------------------------------------------------
## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 600.0 660.0 648.9 700.0 860.0
## --------------------------------------------------------
## subset(prosp_sub, !is.na(credit_score_range_lower))$loan_status: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 500 515 580 595 660 720
In general, this box plot and summaries went as expected. Not including cancelled loans, loans of Defaulted had the lowest mean and median lower range of credit score. Loans of Chargedoff had the second lowest mean and median lower range of credit score (again, not including cancelled loans). Loans of Current had the highest and loans of Completed had the second highest median (tied with Past Due). Past Due did have the second highest mean credit score but Completed still had the third highest and was greater than the Defaulted and Chargedoff mean credit scores. This relationship seems to be stronger than employment duration but as with all the differences in means, a statistical test for significance is needed.
Now get box plots of debt to income ratio versus loan status.

## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.120 0.190 0.264 0.290 10.010
## --------------------------------------------------------
## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.160 0.230 0.262 0.320 10.010
## --------------------------------------------------------
## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Past Due
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1600 0.2300 0.2934 0.3500 10.0100
## --------------------------------------------------------
## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1400 0.2200 0.3693 0.3500 10.0100
## --------------------------------------------------------
## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1300 0.2100 0.3392 0.3300 10.0100
## --------------------------------------------------------
## subset(prosp_sub, !is.na(debt_to_income_ratio))$loan_status: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01504 0.10361 0.15248 0.16487 0.21321 0.34000
With the debt_to_income_ratio versus loan_status box plots, there were some expected and surprising results. The means of debt_to_income_ratio went as I expected with the highest mean ratio being for Defaulted loans and the second highest mean ratio being for Chargedoff loans. I had originally thought this might be the case before looking at the credit score and debt to income ratio histograms in the univariate case and now with the box plots my thoughts are confirmed at least by the means (again, a significance test still needs to be performed). The lowest mean is for Current loans and the second lowest is for Completed. The medians also follow a similar pattern except that the median ratio for Defaulted and Chargedoff is less than Current. Completed loans do have the lowest median debt to income ratio for loan statues outside those that are Cancelled. An individual with a low debt_to_income_ratio should have an easier time paying back loans and the data mostly points to this.
Now I will look at how borrower apr is affected by loan status.

## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00653 0.13271 0.19501 0.20889 0.28525 0.51229
## --------------------------------------------------------
## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06106 0.15833 0.20524 0.21374 0.26528 0.35838
## --------------------------------------------------------
## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Past Due
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06327 0.21827 0.27285 0.26738 0.32576 0.38486
## --------------------------------------------------------
## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00864 0.17722 0.24001 0.23893 0.29776 0.50633
## --------------------------------------------------------
## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01823 0.19003 0.26271 0.25775 0.32958 0.46201
## --------------------------------------------------------
## subset(prosp_sub, !is.na(borrower_apr))$loan_status: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1466 0.1622 0.2074 0.2058 0.2565 0.2565
The results of the summary and box plots above follow what I expected. For both the mean and median, the borrower apr increases as the loan status gets dire. Borrowers with a higher apr should have a harder time paying back loans so it is to be expected that they will have a higher rate of defaulting and being chargedoff. The one exception was that a status of Past Due had the largest mean and median borrower apr but several of these borrowers may be close to defaulting. And they should have a higher borrower apr on average than those who are current or completed.
Next, I will look at how loan status is affected by the categorical variables I have considered (is_homeowner, employment_status, income range, prosper score, and prosper rating).

## prosp_edur$loan_status: Completed
## Employed Full-time Not employed Part-time Self-employed
## 12509 17407 344 794 1793
## --------------------------------------------------------
## prosp_edur$loan_status: Current
## Employed Full-time Not employed Part-time Self-employed
## 48929 1316 248 24 3085
## --------------------------------------------------------
## prosp_edur$loan_status: Past Due
## Employed Full-time Not employed Part-time Self-employed
## 1725 71 26 8 113
## --------------------------------------------------------
## prosp_edur$loan_status: Defaulted
## Employed Full-time Not employed Part-time Self-employed
## 630 2217 27 60 246
## --------------------------------------------------------
## prosp_edur$loan_status: Chargedoff
## Employed Full-time Not employed Part-time Self-employed
## 3529 5343 190 202 897
## --------------------------------------------------------
## prosp_edur$loan_status: Cancelled
## Employed Full-time Not employed Part-time Self-employed
## 0 1 0 0 0
From the above bar chart and summaries there is the expected result that most of the borrowers with loan status of Completed are Full-time. The overwhelming majority of borrowers with a Current status are Employed but as I’ve mentioned before, this definition is very vague. Most of the Defaulted and Chargedoff loan statuses have borrowers who are Full-time. Based on the research I’ve done, employment status isn’t so much a predictor of loan payback success as income range is. The one conclusion I think can be gathered from this is that it is good to have a full-time job to pay back a loan, but from the results this is obviously not always true and I think income range will be a better predictor (Reference 12).
Next, I will look at bar chart of loan status filled by income range to see which income ranges are most associated with each loan status.

## prosp_income$loan_status: Completed
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 88 2504 10203 8906 4729
## $100,000+
## 4537
## --------------------------------------------------------
## prosp_income$loan_status: Current
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 2 2199 13947 16619 9563
## $100,000+
## 10104
## --------------------------------------------------------
## prosp_income$loan_status: Past Due
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 0 133 623 536 288
## $100,000+
## 277
## --------------------------------------------------------
## prosp_income$loan_status: Defaulted
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 6 286 1234 845 361
## $100,000+
## 301
## --------------------------------------------------------
## prosp_income$loan_status: Chargedoff
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 26 1128 3859 2439 1069
## $100,000+
## 845
## --------------------------------------------------------
## prosp_income$loan_status: Cancelled
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 0 0 1 0 0
## $100,000+
## 0
From the above bar chart, income ranges of greater than $75,000 make up a greater percentage of the bars for loan statues of Completed and Current than for loan statuses of Defaulted or Chargedoff. Income ranges of $0-49,999 also make up a greater percentage of all borrowers for Chargedoff and Defaulted loans than for Current and Completed loans. This should all be expected as having more money will make it easier to pay back loans. It has been shown that income range plays a large part in the ability of borrowers to pay back loans with more income being associated with a better ability to pay back loans (Reference 12) so this is not surprising. Note that I am only considering income that is verifiable and has been reported.
Now I will see the affect being a homeowner has on loan status by filling the loan status bar charts with is_homeowner proportions.

## prosp_sub$loan_status: Completed
## False True
## 19889 18390
## --------------------------------------------------------
## prosp_sub$loan_status: Current
## False True
## 26098 30478
## --------------------------------------------------------
## prosp_sub$loan_status: Past Due
## False True
## 1064 1003
## --------------------------------------------------------
## prosp_sub$loan_status: Defaulted
## False True
## 2744 2274
## --------------------------------------------------------
## prosp_sub$loan_status: Chargedoff
## False True
## 6661 5331
## --------------------------------------------------------
## prosp_sub$loan_status: Cancelled
## False True
## 3 2
From the bar chart, fill and summary, there were some expected but also some surprising results. I am surprised that a greater majority of borrowers with a Completed status do not own a home. Owning a home should point to a good credit history. It’s possible that several of the borrowers successfully paying back a loan who are not homeowners are young borrowers with good credit history who simply haven’t tried to purchase a home yet. Something I would be interested to see is how this changes with loan amount (a variable I have not considered in this analysis). The majority of those under a Current loan status are home owners which would be expected. The majority of loans under the Defaulted and Chargedoff statuses have borrowers which do not own a home as I expected. Also, the relative percentages in favor of those who do not own a home is greater for those under the Defaulted and Chargedoff statues than those under the Completed status.
The next two bar charts will look at how prosper score and prosper rating affect loan status by looking at a bar chart of loan status and seeing what proportion of each bar is given by each prosper score and rating.

## prosp_2009$loan_status: Completed
## 1-3 4 5 6 7 8 9-11
## 2311 1856 2237 2872 2373 3750 4470
## --------------------------------------------------------
## prosp_2009$loan_status: Current
## 1-3 4 5 6 7 8 9-11
## 10337 9476 6222 7981 7210 7265 8085
## --------------------------------------------------------
## prosp_2009$loan_status: Past Due
## 1-3 4 5 6 7 8 9-11
## 544 360 282 327 217 233 104
## --------------------------------------------------------
## prosp_2009$loan_status: Defaulted
## 1-3 4 5 6 7 8 9-11
## 182 119 165 173 135 139 92
## --------------------------------------------------------
## prosp_2009$loan_status: Chargedoff
## 1-3 4 5 6 7 8 9-11
## 1026 784 907 925 662 666 366
## --------------------------------------------------------
## prosp_2009$loan_status: Cancelled
## NULL
From the bar chart and summaries given, Completed has the greatest proportion of loans assigned a prosper score of 7-11 relative to Current, Defaulted and Chargedoff loans. This is not surprising as the highest prosper scores carry the lowest risk and should have the greatest success rate for paying back the loans. Completed had the lowest proportion of loans assigned prosper scores of 1-5 while Chargedoff had the highest proportion of loans assigned scores of 1-5 among these four groups. All this is to be expected.

## prosp_2009$loan_status: Completed
## HR E D C B A AA
## 2322 2538 4220 3015 2824 3237 1713
## --------------------------------------------------------
## prosp_2009$loan_status: Current
## HR E D C B A AA
## 2900 5558 7920 14001 11891 10755 3551
## --------------------------------------------------------
## prosp_2009$loan_status: Past Due
## HR E D C B A AA
## 289 375 457 489 278 154 25
## --------------------------------------------------------
## prosp_2009$loan_status: Defaulted
## HR E D C B A AA
## 209 193 282 134 88 79 20
## --------------------------------------------------------
## prosp_2009$loan_status: Chargedoff
## HR E D C B A AA
## 1215 1131 1395 706 500 326 63
## --------------------------------------------------------
## prosp_2009$loan_status: Cancelled
## NULL
The results from the bar chart and summary go mostly as expected. Within a given loan status, the largest proportion of loans that are rated AA are for the Completed status. The lowest proportion of loans that are rated AA are for the Chargedoff status (the proportions go in order from best result to worst, Completed, Current, Defaulted, Chargedoff). Within a given loan status, the largest proportion of loans that are rated HR are for the Chargedoff status. The lowest and second lowest proportion of loans that are rated HR are for the Current and Completed loan status, respectively. Considering ratings AA-B together, within a given loan status, the largest and second largest proportion of loans that are rated AA-B are for the Current and Completed status, respectively. These proportions are both greater than that of Defaulted and Chargedoff. Considering ratings HR-D together, Defaulted and Chargedoff both have higher proportions of loans assigned these ratings than that of Completed and Current. The relationship seems strong because 68 and 70 percent of loans fall into these ratings for Defaulted and Chargedoff versus 46 and 29 percent for Completed and Current, respectively. All these results are as expected since the riskiest loans (HR) should have a higher probability of resulting in Defaulted/Chargedoff statuses while the safest loans (AA) should result in a smaller probability of resulting in Defaulted/Chargedoff statuses. One thing to point out is that there was a higher percentage of Completed loans with a HR through D rating than with a AA-B rating. This is because Prosper simply assigns more ratings of HR-D than AA-B to loans. They would rather be wrong about a likely bad loan actually being good than a good loan being bad, so they assign very few AA loan ratings. This doesn’t change the fact that loans of HR-D have a higher probability of going bad than loans of AA-B.
Summary Observations about Bivariate Analysis
How features of interest varied with other features in this dataset.
The median employment duration increased from a loan status of Completed to Current but then decreased from Current to Chargedoff. There was minimal difference (1 month) between mean employment duration for Completed and Defaulted loans and a difference of 2 months between Completed and Chargedoff loans. The difference was also around 2 months for the means. Because of the minimal difference in employee duration, I will not consider this variable in the multi-variate analysis.
I expected mean and median credit score range lower to be smaller for the Chargedoff and Defaulted loans than for the Completed loans and that’s what I found. The median increased slightly from Completed to Current loans but then decreased from Current loans to Defaulted loans before increasing from Defaulted to Chargedoff. Median credit score for Completed loans was 40 points higher than for Defaulted loans and 20 points higher than for Chargedoff loans. Mean credit score followed the same pattern as median credit score.
I also expected mean and median debt-to-income ratio to be greater for Defaulted and Chargedoff loans than for Completed loans since having a smaller debt-to-income ratio should make it easier to pay back loans. Both the means and medians were greater for Defaulted and Chargedoff loans than for Completed loans. Current did have a higher median ratio than Defaulted or Chargedoff but the means for Defaulted and Chargedoff were greater than for Current.
Excluding past due, there was a consistent increase in the mean and median borrower apr from Completed to Current to Defaulted to Chargedoff loans. Past due had the highest mean and median borrower apr but several of those loans could be headed for Default or Chargedoff statuses. A higher borrower apr will make it more difficult to pay back loans which should result in more Defaulted and Chargedoff loan statuses. These results were expected.
Most of the loans with a Completed, Chargedoff and Defaulted status were Full-time. The overwhelming majority of loans with a Current status were Employed. The definition of Employed is very vague so I will not consider this variable in the multi-variate analysis. I think income range is more important than employment status.
The majority of borrowers in the Completed and Current loan status categories made greater than $50,000. The majority of borrowers in the Defaulted and Chargedoff categories made less than $50,000. Both these results are not surprising as having more income will make it easier to pay off debt.
For most of the loan status categories, there were around an equal number of borrowers who were homeowners as those who were not homeowners. The largest difference in percent was for the Chargedoff status where 56% of borrowers were not homeowners. There were actually more non-homeowners in the Completed loan status category than homeowners which surprised me. Owning a home means that a bank has already approved a buyer for a mortgage so they should be more credit-worthy than non-homeowners. Because of the small differences in each category, I will not consider this variable in the multi-variate analysis.
For prosper score, more loans with a status of Complete fall into a prosper score of 7-11 than a prosper score of 1-5. For statuses of Defaulted and Chargedoff, more loans fall into a prosper score of 1-5 than 7-11. This is to be expected since a low prosper score (1,2,3, etc…) is associated with higher risk of Defaulting and being charged off.
Current and Completed loans both have higher proportions assigned a rating of AA-B than Defaulted or Chargedoff. Defaulted and Chargedoff loans both have higher proportions assigned a rating of HR-D than Completed or Current. The relationship seems strong because 68 and 70 percent of loans fall into these ratings for Defaulted and Chargedoff versus 46 and 29 percent for Completed and Current, respectively. All these results are as expected since the riskiest loans (HR) should have a higher probability of resulting in Defaulted/Chargedoff statuses while the safest loans (AA) should result in a smaller probability of resulting in Defaulted/Chargedoff statuses. There was a higher percentage of Completed loans with a HR through D rating than with a AA-B rating. This is because Prosper assigns more ratings of HR-D than AA-B to loans.
Interesting relationships between the other features excluding the main features in this dataset.
The first relationship I explored which did not include any of the main features of interest from the dataset was the debt-to-income-ratio versus credit score range lower scatterplots. I originally thought there would be overlap between these two variables such that perhaps they increased monotonically with each other, but instead I found a relationship that excluding the outliers, almost looks parabolic (pointing down). There are two regions to the far left and right of this graph for which the debt-to-income-ratio is relatively low but the credit score is low on the left end and high on the right end. My guess is that on the low end, one will find several loans which have borrowers making less than $50,000 a year and on the high end of credit score, one will find several loans which have borrowers making more than $50,000 a year. I want to look at this in the multi-variate analysis (this guess seems to be supported by the credit score versus income range box plot but I did not find a debt-to-income-ratio versus income range box plot). The data appears to peak in the middle around 0.5 debt-to-income-ratio.
The second relationship I explored which excluded the main features of interest was the credit score range lower versus income range box plot. The medians and means of credit score range lower appeared to increase pretty consistently from an income range of 0$ through an income range of greater than $100,000. In the multi-variate analysis, I plan to have the same graph of debt-to-income ratio versus credit score range lower scatterplot but have the points be filled by income range to see what this looks like.
The strongest relationship found.
First of all, the weakest relationships were for employment status duration versus loan status and is homeowner versus loan status. I will not be considering these relationships in the multi-variate case. I will also not be considering employment status in the multi-variate case because the definitions of the categories are vague.
For the numerical variables, the strongest relationship I found was for the borrower apr. This variable increased consistently from a status of Completed to a status of Chargedoff (excluding Past Due).
For the categorical variables, the strongest relationship I found was for loan status filled by income range bar chart. For each of the best case scenarios (Completed and Current), borrowers making greater than $50,000 had greater proportions. For each of the worst case scenarios (Defaulted and Chargedoff), borrowers making less than $50,000 had greater proportions. I do plan on examining this variable extensively in the multi-variate case.
Multivariate Plots Section
First I will look at the scatterplot of debt-to-income ratio versus lower range of credit score filled by income range so that I can see which incomes are associated with each region of the scatterplot.

From the top chart in the visual above, the middle, left and portions of the right side of the graph appear to be dominated by borrowers from the $1-24,999 income range. From the bottom graph however, it’s evident that this was due to over plotting issues. After adding a transparency (note that the colors in the legend now represent 10 points instead of 1), jittering the points and cutting off the outliers above the 99th percentile, I see that borrowers earning above $50,000 dominate the right side and borrowers below $50,000 dominate the left side. In the middle, borrowers earning greater than $50,000 make up the majority below a debt to income ratio of about 0.425 and borrowers earning less than $50,000 make up the majority above a debt to income ratio of about 0.425. I said in the bivariate analysis that there were two portions of the graph on the right and left hand side which had a relatively low debt to income ratio. And that on the right side there was likely many borrowers with the highest income range possible and on the left side there was likely many borrowers with the lowest income range possible. This is generally what is seen (if I exclude income range of $0).
Next, I will see how the debt to income range versus credit score lower scatterplot changes with loan status by faceting the graph over loan status. I will also fill the points by income range to see which income range is associated with each loan status. From the bivariate analysis, higher credit scores seemed to be associated with better loan status and lower debt to income ratio seemed to be associated with better loan status. A higher income range was also associated with better loan status. I want to see what the data looks like when I consider all these graphs together.

From the faceted scatter plots shown (note that each color in the legend represents 10 data points), it appears that there are far less borrowers with an income greater than $50,000 in the Chargedoff and Defaulted categories than in the Completed and Current categories. It also looks as if there is around an equal number of borrowers both below and above $50,000 for the Defaulted and Chargedoff categories whereas for the Completed and Current categories, there are more borrowers with over $50,000 in income. These results should be expected as having more income will make it easier to pay back loans. There are far more borrowers with a lower range of credit score above 800 for the Current and Completed loan status than for the Defaulted and Chargedoff loan status. For the Chargedoff and Defaulted categories, there seems to be more borrowers with a lower range of credit score below 700 than above 700. These results too should be expected as borrowers with a lower credit score will have more risk of Defaulting or being Chargedoff. Debt to income ratio coupled with income range seemed to have an effect on loan status. There were far more borrowers making less than $50,000 who had a debt to income ratio greater than 0.25 who were Chargedoff on their loans than those making greater than $50,000. The within group proportion of those who were Chargedoff with debt to income ratio greater than 0.25 seemed to be far greater for those making less than $50,000 than those making greater than $50,000. In summary, it does appear that having a low credit score and income range coupled with a relatively high debt to income ratio makes it more likely that a borrower will end up Defaulting or being charged off with a loan (the probabilities for both, regardless of the variables may still be less than 50% as there are more borrowers in general who fall into the Current and Completed status).
Now I will look at borrower apr versus credit score range lower scatterplots faceted by loan status and filled by income range. Debt to income ratio didn’t have a big effect on loan status but I do think borrower apr will based off the bivariate analysis. This will be a very similar visual as the one previously.

From the faceted scatter plots shown (note that each color in the legend represents 10 data points), it appears first of all that there are many more borrowers falling into the Completed and Current status rather than the Defaulted or Chargedoff status. There is a great deal more borrowers with incomes over $50,000 for both the Completed and Current loan status than for the Defaulted and Chargedoff loan status. Just like the debt to income ratio versus credit score scatter plot, there appears to be many more borrowers with credit scores greater than 800 for the Current and Completed status than for the Defaulted and Chargedoff loan status. There also appears to be more borrowers with credit scores smaller than 700 for both the Defaulted and Chargedoff loan status than for the Completed and Current loan status. In terms of borrower apr, it looks like income range strengthens the relationship between borrower apr and loan status but not as much as debt to income ratio. Within the Chargedoff loan status, there appears to be more borrowers with an apr greater than 0.25 who also have incomes less than $50,000 than those who have incomes greater than $50,000. But there are more borrowers in this range of apr with incomes greater than $50,000 in the Chargedoff status than in the similar range for debt to income ratio. From these scatterplots, it does appear that having a higher income, a lower apr and a high credit score are all indicators that the probability of Defaulting or being Chargedoff on a loan will be minimized.
Next, I will look at bar charts of loan status filled by prosper score and faceted over prosper rating. I will do this to see which combination of prosper score and rating produces the best and worst outcomes of loan status. I am doing this because the bar chart of the best and worst combination in the univariate analysis and the filled bar charts in the bivariate analysis seemed to indicate that there was a relationship between these variables.

The first thing to notice about these faceted bar charts is that there are virtually no loans assigned a rating of AA that belong to the Past Due, Defaulted or Chargedoff loan statuses. Almost all of the loans in the Completed and Current status for AA loans have prosper scores of 8-11. It appears then that for a loan to have a prosper rating of AA is to have virtually a 0% chance of resulting in anything but Current or Completed (this is only for this sample data and more data would be needed to really confirm this). Starting with a prosper rating of D and continuing until HR, the proportion of Chargedoff loans seems to stay constant (around 1.25%) but the proportion of Current and Completed loans continues to decrease. It should be noted that since the proportions shown represent the proportion of all borrowers (and not just those within each group), the proportion of Chargedoff loans actually increases from D through HR within each group even though the proportion of total borrowers remains the same. A prosper rating of D actually has the largest proportion of Completed loans with around an equal number of loans being between 1-5 and 7-11. This proportion is greater than the proportion of prosper rating AA. As I mentioned in the bivariate analysis, this is because prosper assigns far more loans a lower rating rather than AA. Prosper would rather be wrong about a bad loan actually turning out good than a good loan turning out bad. The proportion of prosper loans with status of Completed is greater for AA loans than for D loans within group (this is important to remember). The trend is that the proportion of loans Chargedoff increases from prosper rating of AA through HR and the proportion of a loan Completing increases from a rating of AA to a rating of C but decreases from a rating of D through HR (note that the proportion within groups actually decreases from AA to C but increases from D to HR). Virtually all of the loans assigned a rating of HR had prosper scores between 1 and 4 but there were some loans assigned ratings of C through E that also had prosper scores between 8 and 11. Most of these loans fell into the Current or Completed category but a few fell into the Defaulted or Chargedoff category (this number was really low). For ratings of B through AA, there were some loans assigned scores 1-4 but they almost all fell into categories of Current or Completed. In terms of prosper score and rating strengthening each other, combinations of very good prosper scores and very good prosper ratings do tend to minimize Defaulted and Chargedoff loan statuses but the opposite is not true. Combinations of bad prosper scores and bad prosper ratings don’t necessarily minimize statuses of Completed and Current (rather they maximize statuses of Chargedoff and Defaulted).
Summary Observations about Multivariate Analysis
Features that strengthened each other in terms of looking at the
relationship with the feature of interest.
Prosper score and prosper rating strengthened each other in terms of predicting loan status. Combinations of really good prosper ratings and scores minimized Defaulted and Chargedoff loan status proportions (both within each group and out of the total borrower population) and they maximized within group proportion of Completed loans. Combinations of really bad prosper ratings and scores didn’t necessarily minimize loan status proportions of Completed but they did maximize within group loan status proportions of Chargedoff loan statuses.
I thought that income range and credit score also strengthened each other in terms of predicting loan status. From the borrower apr versus credit score scatter plot, there were very few loans in a status of Defaulted or Chargedoff which had credit scores beyond 800 and also had an income above $50,000. There were also very few borrowers with credit score below 600 and with income range below $50,000 in the Current loan category.
For debt to income ratio, it did seem that for borrowers making less than $50,000, the within income range proportion of them who were Chargedoff and had a debt to income ratio greater than 0.25 was far greater than that of borrowers making greater than $50,000 (most of the borrowers making greater than $50,000 were in the Current or Completed categories and had ratios less than 0.25). This was across all credit scores. The same could be said for borrower apr but the difference in within income range proportions will be smaller between those making less than $50,000 and those making greater than $50,000. Income range and borrower apr seem to strengthen each other, but not as much as income range and debt to income ratio.
Interesting and surprising interactions between features.
I was initially surprised that so many loans with a risky prosper rating and score were able to be Current or Completed (prosper loans with rating of HR actually had a higher proportion of borrowers out of all borrowers with a Completed loan status), but then I realized this was because Prosper assigns many more loans a rating of HR than ratings such as AA. I was also surprised that borrower apr didn’t appear to be strengthened as much by income range as debt to income ratio. I figured the opposite would be true based off the bivariate box plots.
Potential for models in the future
I did not create any models with the dataset. In future analyses, I would be interested in what a logistic regression used to predict loan status using income range, credit score, prosper score and prosper rating would look like.
Final Plots and Summary
Plot One

Description One
The above scatterplot represents Debt to Income Ratio versus Credit Score Range Lower filled by Income Range. There are two primary reasons I included this graph in the final plots. The first is to clear up a possible misconception that the smaller a borrower’s debt to income ratio, the better their credit score will be because it will be easier for them to pay back loans. From this scatterplot, this is obviously not the case. There are several borrowers who have relatively low credit scores who also have debt to income ratios below 0.25. It’s also not true that having a relatively high debt to income ratio will automatically result in bad credit scores (see the number of borrowers above a ratio of 0.5 who hover around the 700 credit score mark). Clearly, more is going on here than just debt to income ratio. It appears that income range is a far better predictor of credit score at least for income ranges above $50000 (most borrowers with income ranges above $50000 have credit scores beyond 700 and very few exist below 600). For that below $50000, other factors must come into play as the data is spread across most of the graph (although there appear to be very few borrowers with income less than $50000 and credit scores beyond 800). The second reason I included this graph was because the original version in the bivariate analysis was what let me know that I needed to consider these two variables separately against loan status and that they weren’t correlated enough to discard one of them (debt to income ratio and credit score range lower).
Plot Two

Description Two
The above box plot represents Credit Score Range Lower versus Loan Status. I chose this box plot because of its importance evident from the two scatterplots faceted over loan status I examined in the multivariate case. From the first scatterplot (debt to income ratio versus low range of credit score), there were far more borrowers with a lower range of credit score above 800 for the Current and Completed loan status than for the Defaulted and Chargedoff loan status. For the Chargedoff and Defaulted categories, there seems to be more borrowers with a lower range of credit score below 700 than above 700. The same was evident in the borrower apr versus lower range of credit score scatterplot. It also appeared as if credit score strengthened the other variables in terms of determining probability for being in the various loan statuses; the probability of having high borrower apr, low credit score, relatively low income and being chargedoff seemed to be greater than what was present in borrowers with low apr, high credit score and relatively high income.
In this box plot, we see the expected outcome that borrowers in the Completed category have on average higher credit scores than those in the Defaulted and Chargedoff categories. The median of those in the Completed category is also greater than that of Defaulted or Chargedoff. This variable seems important and in the future I would like to see it in a logistic regression analysis to predict the various loan statuses of interest.
Plot Three

Description Three
The above visual is a bar chart of proportion of total borrowers in each loan status filled by prosper score and faceted over prosper rating. I included this plot because prosper score and prosper rating are the unique values prosper uses to assess the risk of a loan to potential investors. I thought it was important to see how well prosper does at predicting probabilities associated with the loan statuses of interest (Chargedoff, Defaulted, Current and Completed). I think the results point to prosper achieving their goal of minimizing the probability of a loan having a bad outcome via their rating and scoring system. The within group proportion of borrowers completing a loan are greatest for those with a prosper rating of AA. This rating is for the most credit worthy borrowers. The proportion of borrowers who are Chargedoff increases from a rating of AA to D but then stays constant from D through HR. The within group proportion of those Chargedoff decreases from a rating of AA through C but then increases from C through HR.
Reflection
The exploration I performed was that of determining which variables are associated with success or failure in paying back loans from Prosper. There were 81 total variables in the dataset and I picked a subset of 11 of these variables including the dependent variable loan status. During the univariate analysis, I discovered several things I needed to do to modify the data including combining categories (all the Past Due categories in employment status), removing missing data, and removing missing data from the Prosper Score and Prosper Rating variables to only consider data post 2009. I thought combining categories was a struggle because there were several things I needed to look up online to be able to accomplish this goal. Of all the tasks I performed in the univariate analysis, I thought the task that most went well was looking at loan status for the best and worst combinations of prosper score and prosper rating. It was these charts that confirmed my suspicion that borrowers with the best scores and ratings have higher success rates in paying back the loans and lower failure rates versus the worst combination of prosper rating and scores. I thought the most surprising part of the univariate analysis was the days’ delinquent histograms for the best and worst combination of prosper score and rating. For these histograms, mean days’ delinquent was actually less for the worst combination of prosper score and rating versus the best prosper score and rating.
In the bivariate case, I first looked at various scatterplot matrices of variables plotted against each other. I wanted to see if any of the non-categorical/numerical data could be discarded due to being correlated with each other. Both from the scatterplot matrices and the debt to income ratio versus credit score range lower scatterplot, I found that this was not the case. So I considered all the numerical variables against loan status via box plots. I was surprised that credit score range lower and debt to income ratio did not correlate with each other in a monotonic way. In general, each of the variables I considered went the way I expected as loan status progressed from best outcome to worst. I thought the employment duration went as expected but there didn’t appear to be a large difference between each of the loan statuses so I did not consider it in the multivariate case. Mean credit score range lower decreased with worsening loan status. Mean borrower apr and debt to income ratio increased with worsening loan status.
In the bivariate case for the categorical variables, I saw that whether a borrower is a homeowner and employment status were not very helpful in predicting loan status. The proportions of homeowners and non-homeowners were about even for each loan status and the employment status categories were very vague. It was kind of surprising to me that being a homeowner did not seem to affect loan payback success as owning a home should correlate with good credit history. I did not consider either employment status or is borrower a homeowner in the multivariate case. For the loan status bar chart filled by income range, I did find that having a higher income was associated with less loans being Defaulted/Chargedoff and more loans being in Completed or Current loan status. For those in the lower income ranges, there was still a greater proportion in the Completed and Current loan statuses, but for a loan status of Defaulted, there was a higher proportion of borrowers in the lower income ranges (less than $50,000). For prosper score, a greater proportion of borrowers in the Completed and Current loan status had scores of 7 or greater than those with scores of 5 or lower. The reverse was true for statuses of Defaulted and Chargedoff. For prosper rating, a greater proportion of borrowers in the Completed loan status had ratings of HR through D but the within ratings ratio of Completed to Defaulted/Chargedoff was greater for AA to B than for HR through D. The insights I found with income range and prosper score/rating were all to be expected.
In the multivariate analysis, I first looked at debt to income ratio versus credit score range lower filled by income range. Most of the borrowers with credit scores above 800 had high incomes and low debt to income ratio. Borrowers with relatively low income ranges were more spread out in terms of both credit score and debt to income ratio. For credit scores below 550, borrowers had a wide range of debt to income ratio versus borrowers making high incomes and having credit scores greater than 800. This highlights the importance of income on financial health. I wanted to look at this graph because I wanted to see why there were two portions of borrowers on both ends of the graph who had relatively low debt to income ratios but diverging credit scores. My suspicion was that income range was the difference and I was correct. I next looked at two scatterplots of debt to income ratio versus credit score range lower filled by income range and borrower apr versus credit score range lower filled by income range faceted over loan status. I did this so that I could see how the independent variables impacted loan status. In general, having a higher income, higher credit score and better debt to income ratio and borrower apr led to better loan status outcomes and less of a spread in the data. Finally, I wanted to see how prosper score and prosper rating interacted with each other and to see how they were associated with loan status. The best combination of prosper score and rating produced the best outcomes (Completed and Current) in terms of within group proportions while the worst combination of prosper score and rating produced the worst outcomes in terms of within group proportions (Defaulted and Chargedoff).
In summary, I thought that income, credit score, prosper score and prosper rating were the most important variables for determining loan status. One of the first things that should be done in the future with this project is to perform a logistic regression analysis considering the four variables I mentioned and see how well they predict proportions of borrowers who fall into the various loan categories. I would also like to normalize the days’ delinquent variable by time since loan origination to see if there is a tilt in the favor of borrowers who have the best combination of prosper score and rating versus the worst combination of prosper score and rating. As I’ve mentioned before, a borrower who is halfway through paying back a loan but is 30 days’ delinquent isn’t going to be viewed in the same way as a borrower who has the same days’ delinquent but is only two or three months into paying back a loan.